FungiProteomeDB: a database for the molecular weight and isoelectric points of the fungal proteomes

Abstract Proteins’ molecular weight (MW) and isoelectric point (pI) are crucial for their subcellular localization and subsequent function. These are also useful in 2D gel electrophoresis, liquid chromatography–mass spectrometry and X-ray protein crystallography. Moreover, visualizations like a virtual 2D proteome map of pI vs. MW are worthwhile to discuss the proteome diversity among different species. Although the genome sequence data of the fungi kingdom improved enormously, the proteomic details have been poorly elaborated. Therefore, we have calculated the MW and pI of the fungi proteins and reported them in, FungiProteomeDB, an online database (DB) https://vision4research.com/fungidb/. We analyzed the proteome of 685 fungal species that contain 7 127 141 protein sequences. The DB provides an easy-to-use and efficient interface for various search options, summary statistics and virtual 2D proteome map visualizations. The MW and pI of a protein can be obtained by searching the name of a protein, a keyword or a list of accession numbers. It also allows querying protein sequences. The DB will be helpful in hypothesis formulation and in various biotechnological applications. Database URL https://vision4research.com/fungidb/


Introduction
Fungi are one of the most prominent non-vascular, nonphotosynthetic heterotrophic organisms on the earth that play an essential role in food, health, medicines and biocontrol application (1)(2)(3). The diversity of the fungi ranged from 500 000 to 9.9 million (4). They are the highly abundant organisms on the earth due to their small size and potential to withstand diverse ecological conditions (5,6). The fungi species have dominated the world from arctic polar to tropical habitats (7)(8)(9). Due to their varied genomic and evolutionary plasticity, they are used in various beneficial applications, from enzyme technology to the food industry (10,11). Some of the fungi are also highly infectious and cause severe diseases in plants and animals (12,13).
Due to their enormous biological importance, Joint Genome Institute launched a research project named MycoCosm, which targeted sequence 1000 genome (https://mycocosm.jgi. doe.gov/mycocosm/home/1000-fungal-genomes) of the fungi kingdom. Similarly, to date, the fungal genome database (DB) reported 256 genome sequences of the fungi. The National Center for Biotechnology Information (NCBI) also reported 11 940 fungal genome sequences (https://www.ncbi. nlm.nih.gov/genome/browse#!/eukaryotes/). The Ensembl-Fungi DB (https://fungi.ensembl.org/index.html) has reported the genome sequences of 1506 fungal species. This shows that researchers are trying to sequence more fungal genomes to give us better genomic information about the presence of novel genes and proteins and their biological implications. The presence of a large number of the fungal genome can also help us to understand their evolutionary perspective and adaptation to the diverse ecological niche so that we can use them in various biotechnological applications.
The genome sequence data of fungi have progressed enormously over 5-10 years (14,15). It allowed us to get the annotated protein sequences of the respective genes (16,17). The availability of advanced next-generation sequencing technology and powerful proteomics technology, specifically liquid chromatography-mass spectrometry (LC-MS)-enabled highthroughput protein identification and functional assignment tool, led us to identify several novel functional proteins (18,19). Proteomics has the potential to identify information regarding protein identity, localization and post-translational modification (20,21). The in silico gene model prediction from the genome and its availability of translated protein sequence becomes integral to large-scale 'omic' study.
Furthermore, the systems biology approach has enabled us to understand the complex interactions of proteins with other proteins and various other interacting biomolecules (22)(23)(24).
Enormous efforts are being made for optimal protein extraction from fungi and establishing details of proteomics data relative to their types and abundance (25). Also, considerable efforts are made to catalog proteins from mycelial, secreted or organellar origin across the range of fungi kingdoms (26,27). Different approaches, including sodium dodecyl sulfatepolyacrylamide gel electrophoresis (SDS-PAGE) or 2D-PAGE fractionation, or the shotgun approach, were used to identify specific proteins using tandem LC-MS to generate the protein data (28,29). In a study using the 2-DE gel electrophoresis approach, the effect of carbon source, an antifungal drug and gene deletion was explored at the proteomic level (30,31). The most important aspect of the fungal proteome research is associated with annotated protein sequences as 'predicted protein' or 'hypothetical protein', making obtaining pertinent information on proteomics data challenging. Ijaq et al. mentioned them as known-unknown proteins (32). However, to have uniformity, it would be better to say them as 'protein of unknown function' rather than a hypothetical or predicted protein or protein of known-unknown function.
Because the protein is identified and annotated, it exists and is no longer a hypothetical protein. Assigning suitable functions to these 'proteins of unknown function' will be an essential aspect of fungal proteomics. However, it is challenging to deduce the function of all the proteins with unknown functions. However, the basic characteristics of a protein can be elucidated from its molecular weight (MW) and isoelectric point (pI). The protein gets separated in the 2D-PAGE based on its MW and pI. In the first dimension, proteins get separated on a gel using its isoelectric focusing (IEF) that separates proteins according to its pI. In the second dimension, the IEFseparated protein gets shifted to SDS-PAGE that separates the protein according to the MW. Therefore, understanding the MW and pI (2D) of fungal protein can be essential to hint at their basic biochemical principle. These biochemical details can be quite useful in predicting the function of a protein.
The proteins have diverse molecular mass and pIs (33,34). A protein's shape, size, solubility, MW and pI determine its ability to move across different cellular compartments and potential function (35)(36)(37)(38)(39)(40). The pI indicates the pH at which the net charge of a protein is zero (33). The dissociation constant (pK a ) of a polypeptide is determined by the presence of seven charged amino acids; arginine, aspartate, cysteine, glutamate, histidine, tyrosine and lysine (41)(42)(43). The N-terminal NH 2 -and C-terminal COOH groups of a protein influence the charge of a polypeptide (44)(45)(46)(47)(48). Molecular mass and pI have been used to determine the position of a protein sequence in a proteome map. It provides the required information to bioinformaticians and genome scientists seeking to understand the molecular basis, subcellular localization and function of a protein (49,50). Several attempts have been made to create a DB of experimentally validated proteins (51-56). However, it is difficult to experimentally validate every individual protein's pI and MW in a proteome. Hence, we elucidated the MW and pI of 685 fungal proteomes and constructed a DB so that users can use the DB and can get basic information of MW and pI of the fungal protein and their putative localization and function. The study included information on 7.127141 million protein sequences in total. The DB provides various searching and browsing options to explore 7.127141 million protein sequences efficiently. The DB (i) has a search engine that allows one to explore a virtual 2D map of the fungi proteome, (ii) species search by species name or any other attribute (i.e. the total number of protein sequences, the total number of acidic pI proteins, total number neutral pI proteins and total number basic pI proteins), (iii) search a specific species by accession number, and/or by protein name, and/or by whole sequence or subsequence, (iv), provide summary statistics of individual species as well as of the whole collection and (v) the interface also provides options to copy, save and print the retrieved information, with the ability to export the information in a variety of file types, including Excel, comma separated (csv) and pdf.

Dataset
Annotated protein sequences of 685 fungi species were downloaded from the NCBI, MycoCosm and Ensemble. The downloaded protein sequence files were used to calculate the MW and pI using a protein pI calculator (http://isoelectric.org/) (IPC Python) within a Linux-based platform (57). The IPC program provided the individual protein sequences' MW and pI. The results were subsequently processed using Microsoft Excel and Python library Pandas.

Construction of modules
The FungiProteomeDB has three main modules and hence relevant web pages. Other two pages are the home page and feedback page.

Construction of home page
The home page comprises a brief description of the DB https:// vision4research.com/fungidb/, updates and news feed about the fungi kingdom. The DB offers a user-friendly interface to search for specific protein information and their summary statistics.

Construction of summary page
Summary of whole fungi DB https://vision4research.com/ fungidb/ is described under four subsections: (i) Proteomes vs. proteins summary (ii) Proteomes vs. pI summary (iii) pI protein types summary (iv) Proteomes vs. MW summary.
It is a webpage with dynamic content. All the statistics were performed using DB manipulation language SQL on a single but large DB table comprising the data of all the 685 species. This DB design makes SQL-based data analysis more efficient as all the data are in a single denormalized table (in DB terminology). If we add or remove some information from the DB, all the statistics will be updated automatically in real-time, avoiding inconsistent data. At the cost of space, we achieved real-time efficiency in data analysis.

Construction of species page
Species module https://vision4research.com/fungidb/ provides the following information: (i) List of all species with a count of total proteins, which were further classified into total acidic, basic and neutral proteins. Note that it is a dynamically updated list in case of the addition or removal of some species An important feature of the page is the virtual 2D proteome map. A 2D proteome map is a scatter plot where each point is an ordered pair of pIs and MW. The data points of proteins were colored/classified according to pI values: (i) if 0.00 ≤ pI < 7.00, it is an acidic protein, plot it with red color; (ii) if pI = 7.00, it is a neutral protein, plot it with yellow color and (iii) if 7.00 < pI ≤ 14.00, it is basic protein, plot it with green color.

Construction of proteome page and database design
This is the core module of FungiProteomeDB that contains each species' detailed proteomic information and many search options. From the point of DB design, an individual DB table for each species was created, and each table had the following attributes for each protein: accession numbers, amino acid sequence, name of proteins, MW and pI.
The Proteome page has the following two main search modes/modules:  Figure 7 shows details about the multi-select search mode.
Both the substring and multi-select search modes of the proteome module have six further submodules or features. They are (i) search by species, (ii) search by accession number, (iii) search by protein name, (iv) search by sequence, (v) search by MW range and (vi) search by pI range.
The 'search by species' module allows selecting a specific species using a drop-down list or entering a partial name. Internally, it uses SQL wild card string search option.
The module 'search proteins by protein name' or 'keyword' provides the user with the ability to search for information on a protein using an accession number, protein domain name, MW or pI within a species or by a keyword related to the protein domain name. A user can save or print the information once retrieved from the DB. The interface has a list of species names and a keyword search option. A text box is provided for entering a protein domain name or keyword. Instructions for using the module are also provided. The interface also includes a window for displaying search results for a given species. By default, it will display two effects within the first species on the list for entered keywords. The interface has options for copying or printing the retrieved results.
(i) 'Sub-search option on retrieved search results': The subsearch option queries the table that is formed from retrieved data. Users can use the sub-search option to locate a specific entry or entries in the retrieved results. For example, in searching for a protein by name, a user can search a keyword of a protein domain name from the interface. (ii) 'Sorting options on each column': Users can sort any columns within the retrieved results table by using a mouse click.
In module 'search proteins by accession number' after selecting a species (otherwise, first species will be displayed by default), when the user starts entering the accession number, autocomplete functionality helps the user to select the accession number from the selected species. This interface also provides the ability to search the DB using the accession number, protein domain name, MW or pI within a specific species by using a single accession number or a list of accession numbers. The module has options that allow the user to copy, print or export the retrieved results as an Excel, pdf or csv file. The user can also save the retrieved information for subsequent use.
In the proteome module (page), to search 'proteins by protein name', the user has to select a species (otherwise, first species is by default selected). After selecting species, when the user starts entering a protein domain name, autocomplete functionality helps the user select a protein domain name.
In the module 'search by protein sequence', we provide the ability to search the DB using a protein sequence within a species using a list of accession numbers provided by the user. The interface has a list box of species names with a protein domain name or keyword search option. A text box is provided for entering a list of accession numbers, and instructions for using this module are also provided. The interface also includes a window for displaying sequences within a selected species name. By default, it will show two results within the first two accession numbers of the first species in the list.
The construction of the two modules-'search proteins by protein name' and 'accession number', was completed in two steps: (i) 'Preprocessing of the data': The protein data were collected from each species and placed in a Microsoft Excel file which was generated through IPC software (57). Accession number, protein name, MW and pI of each species were mentioned in it. The Excel files were then converted to csv files to import the data into MySQL. (ii) 'Database design': The DB was implemented in MySQL Server, a DB management system. There were 685 tables in the DB; the names and brief summary of each DB table are described in Table 1. After making the DB tables for all species, they were imported using a command prompt. The DB contains protein sequence tables for each of the 685 fungi species. The DB was designed to save all proteomic data of fungi species efficiently with minimum space and high performance. Table 1 shows the tables' names, purposes and dependencies.   The construction of the 'search protein sequence' was also completed in two steps: (i) 'Preprocessing of the data': For the development of this module, Fasta files were converted to a csv format using a python language script and then imported into the MySQL Server. The csv files were then compressed for efficient memory use.
In the proteome module (page), to 'search proteins by molecular weight (MW) range', the user has to select a species (otherwise, first species is by default selected). After selecting species, the user selects start and end values of MW from a range widget.
In the proteome module (page), to 'search proteins by isoelectric point (pI) range', the user has to select a species (otherwise, first species is by default selected). After selecting species, the user selects the start and end values of the pI from a range widget.

Front end and backend languages and tools
Languages and tools used in FungiProteomeDB are as follows:

Backend languages and tools
The backend uses the MariaDB DB to store all the data. PHP was used as a server-side scripting language that responds to client-side requests and interacts with the DB. An open-source and secure platform Codeigniter 3.1.11 was used to securely run PHP scripts. Summary results were calculated by running SQL queries directly to phpMyAdmin.

Front end languages and tools
JavaScript library of JQuery and its different extensions, i.e. JQuery-IU, DataTables and Select2, were used to make website HTML contents attractive and dynamic. Scatter plots (2D proteome map) were created using the JavaScript library convasjs.js. Different CSS libraries of JQuery and Bootstrap were also used to add beauty and attractiveness to web pages.

Results
In addition to the home page https://vision4research.com/ fungidb/, four more web pages were developed (Figure 1).

Time efficient search interface by species and their attributes
In all 685 species names, a total number of proteins were further divided into neutral, acidic and basic pI proteins. This information is fetched automatically from the DB table at the time of page load, hence providing the latest information. Species can be quickly searched by name or any  attribute available in the table by search box auto-focused and displayed at the top of the page (Figure 2).

Species attributes sorting
All the following attributes of species can be sorted in both ascending and descending orders by just clicking the heading of the row: (i) Serial or by default order number (ii) Name of the species (iii) Total number of protein sequences (iv) Number of acidic pI proteins (v) Number of neural pI proteins (vi) Number of basic pI proteins.

Species records per row
By default, there are 10 rows per page. Users can also change it to 25, 50 and 100 records per page. These customized data can be further used by the users for research with ease as it will provide information in a structured format (suitable for further analysis).

Species download in verities of formats
Users can copy the species in a clipboard or download it in csv, Excel, or pdf format. Moreover, a user can print the retrieved results.

Species pagination
Users can navigate previous, next or any other specific pages.

Virtual 2D proteome map
It can be viewed by clicking the 'View Map' button against a species name (Figure 2, seventh column). Figure 3 shows the 2D map of Sphaerobolus stellatus. It is a bi-modal distribution, showing less variation in acidic proteins regarding pI, and some values have more MW than neutral or acidic proteins. A few instances of neutral protein can be seen on a vertical line where pI = 7. The top left row of the map also shows the total proteins (#dots) represented in the map, and it was 35 181 in the case of S. stellatus.

Proteome search interface
There are four types of search options on each search mode in which proteins can be viewed. There are two search modes of (i) substring search and (ii) multi-select search on the proteome page. Figure 4 is the default view of the proteome page https://vision4research.com/fungidb/pages/proteomes when it loads. See Figures 5 and 6 for help with how to use this interface for substring search mode and Figures 7 and 8 for multi-select search mode.
(i) 'Proteome/species search': In both search modes, a single species was selected in this initial search. Figure 5A shows that a user can change species using auto-filling options or from a drop-down list. (ii) 'Proteome accession number search': In substring search mode, by default, all accession numbers of selected species will be selected, but the user can search specific proteome accession numbers ( Figure 5B). An autocomplete option is also provided here, in which the accession number will be shown and can be selected. A user can also search substrings on her/his will. Internally, this option of autofill was achieved with wild card search in SQL statement, '%substring%'. In multiselect search mode, by default, no accession number will be selected, but the user can select multiple accession numbers by multi-select menu. (iii) 'Search by protein name': In substring search mode, by default, all protein names of selected species will be selected, but the user can also search specific proteins by name. An autocomplete is provided in which the  protein domain name will be shown and can be selected. However, the user can also search for substrings if they want. The substring will fetch data in by wild card like '%substring%'. In multi-select search mode, by default, no protein name will be selected, but the user can select multiple protein names by multi-select menu. (iv) 'Search by sequence': In substring search mode, by default, all sequences of selected species will be selected, but the user can also search for specific protein sequences. An autocomplete is provided in which protein sequence will be shown and can be selected.
However, users can also search for substrings if they want. The substring will fetch data in by wild card like '%substring%'. In multi-select search mode, no sequence will be selected by default, but the user can select multiple sequences by the multi-select menu.    (iv) 'Pagination of retrieved results': A user can navigate the previous, next or specific pages.

Summary statistics
The module provides users with an overview of the overall statistics of the DB. The general statistics provided for the proteome of each species are: (i) Sequence count (ii) Average MW and average pI (per each proteome) (iii) Average MW and average pI (per each protein) (iv) Number of acidic, basic and neutral pI proteins  (v) Percentage of acidic, basic and neutral pI proteins.
The overall statistics of the FungiProteomeDB DB https:// vision4research.com/fungidb/ is provided in Table 2.

Discussion
Several DBs provide important information about different organisms' genomic and proteomic aspects. The proteomics DB (https://www.proteomicsdb.org/) reported by Wilhelm et al. described the mass spectrometry (MS)-based draft of the human proteome (58). They reported the presence of conserved controlled protein abundance when comparing the messenger ribonucleic acid and protein expression profiles (58). Furthermore, their analysis with integrated drug-sensitivity data enabled them to identify resistant or susceptible proteins for a particular drug (58). The proteome profile can also enable understanding the stoichiometry  (60). It uses the mass spectrometer output files from various organisms and searches using the latest search engines and protein sequences (60). The PeptideAtlas uses MS data of small peptides and enables them to map with the genome of the eukaryotic organism (60). A considerable analytical process with constant statistical validation leads to identifying peptides and proteins (60). The Arabidopsis Pep-tideAtlas was developed to harness worldwide proteomic data for comprehensive proteomic community resources (61). It provides proteomic information on post-translational modification and splice forms of specific proteins (61). The Pep-tideAtlas identified 17 858 unique proteins at the highest confidence level (61). The plant proteome DB (http://ppdb.tc. cornell.edu/) reports the experimental data of proteome and MS analysis. PlantMwpIDB reported the proteomic details of plant proteomes using proteins' MW and pI (62). It reports curated information on protein function, subcellular localization and protein properties (63). The fungal secretome DB (https://fsd.snu.ac.kr/) reported the secretary proteins of 158 fungal species comprising 208 883 proteins (64). It comprises 15.21% of the total proteome. Although these fungi-related DBs were constructed to elucidate the proteomic details, they were mainly based on experimental MS data. Therefore, it is challenging to elucidate the proteomic information of a large number of proteins. We used the MW and pI data to overcome the issue and construct the DB. This will enable us to find the proteins with acidic and basic pI proteins. The basic pI proteins usually reside in the basic pH range cellular compartment (65

Conclusion and future work
The proposed FungiProteomeDB allows researchers to retrieve information on the MW and pI of proteins within the proteomes of 685 fungi species. FungiProteomeDB is a comprehensive DB available for fungi proteomes and contains several modules for searching, retrieving and saving data. Future versions of FungiProteomeDB will make the DB more powerful for obtaining information on the proteome of the entire fungi kingdom. It will also include a protein molecular modeling module to decipher the 3D structure of each protein, target site prediction for metacaspases, palmitoylation, myristiylation and methylation for each protein. This additional information will provide important information to researchers investigating protein modification, function, structure and evolution. Currently (in the proteins search), only one species can be searched by different attributes. In our future version, any species will be searchable at a time by any attribute(s) number. Moreover, we want to add an option for registered users with admin privileges, who can upload new MWs and pIs of different species or a protein or its annotation. It will be part of a DB that automatically allows the submission of proteomic data and all the related information. We will also like to search and summarize unique biomarkers in the fungi kingdom [a patch of an amino acid subsequence of length n (n ≥ 2-5), which is present in the whole proteome file]. Currently, species sort by count values of protein or pI is provided; we would also like to add sort by sum values of pI or MW.

Data availability
All the data used in this manuscript are taken from the publicly available "National Center For Biotechnology Information" (NCBI) database and all the data can be found in our database.

Author contribution
T.K.M. conceived the idea, collected and calculated the MW and pI of proteins, analyzed and interpreted the data and drafted and revised the manuscript. M.R. and M.O. analyzed and interpreted the data, drafted and revised the manuscript, designed, constructed and tested the DB.